{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Time Series Modeling\n",
    "\n",
    "2017-04-28, Josh Montague\n",
    "\n",
    "This is Part 2 of the [\"Time Series Modeling in Python\" series](https://github.com/DrSkippy/Data-Science-45min-Intros/tree/master/time-series). \n",
    "\n",
    "In [Part 1](https://github.com/DrSkippy/Data-Science-45min-Intros/blob/master/time-series/01%20-%20Time%20series%20data%20in%20pandas.ipynb), we looked at data structures within the `pandas` library that make working with time series particularly convenient. \n",
    "\n",
    "Here, in Part 2, we'll look at the some of the simplest methods for modeling a time series, making forecasts, and evaluating the accuracy of our models. We'll take advantage of both `pandas` and `numpy` functionality since they're both great for these sorts of tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "import copy\n",
    "from IPython.display import Image\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import random\n",
    "from sklearn.metrics import r2_score, mean_squared_error"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "plt.rcParams[\"figure.figsize\"] = (8,6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# `unemploy.csv` is included in the repo - it's a small csv\n",
    "! head misc/unemploy.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# get data from local dir\n",
    "data = pd.read_csv('misc/unemploy.csv', parse_dates=True, index_col=0)    \n",
    "\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# this is for the data we have locally\n",
    "data_name = 'UNEMPLOY'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "----------------\n",
    "\n",
    "`< optional web data acquisition >`\n",
    "\n",
    "If you want to experiment with other easily accessible data..."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# first, change this cell type to \"code\"\n",
    "! pip install pandas-datareader"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "source": [
    "# choose a different data_name to grab any other data set by name, \n",
    "#   which you can find here: https://fred.stlouisfed.org/\n",
    "#   and then update the `data_name` variable in the previous cell\n",
    "import pandas_datareader.data as web\n",
    "\n",
    "start = pd.to_datetime('2010-01-01')\n",
    "end = pd.to_datetime('2017-01-01')\n",
    "\n",
    "data = web.DataReader(data_name, 'fred', start, end)\n",
    "\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "`</ data acquisition >`\n",
    "\n",
    "--------------------"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# what does this data look like?\n",
    "data.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Simple models \n",
    "\n",
    "\n",
    "Let's start with the simplest set of things we can possibly do in our forecasting model[1]. These models are computationally simple, require only a small amount of data to make predictions, and generally provide a baseline over which to compare more complicated models.\n",
    "\n",
    "Generally, we will want some form of a `forecast()` method. \n",
    "\n",
    "But we'll get to abstractions in a moment. First, let's start by just doing it directly. \n",
    "\n",
    "## Naive method ( + `.shift()`)\n",
    "\n",
    "Let's start with the most straightforward and obvious forecasting method possible: the next data point will be the same as the previous data point. This approach (reasonably) assumes that in many real world systems there is some form of inertia or momentum in the underlying processes that is greater than the associated random fluctuations. \n",
    "\n",
    "In form, what we'd like is a DataFrame with a column of observed data and column of forecasted data. \n",
    "\n",
    "To start, our forecasted column will be a shifted version of the existing values column. We can do that with `pandas`' `.shift()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# check the top of the dataset\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# what does the shift() method do?\n",
    "# (remember that pandas methods return a new df)\n",
    "data.shift().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# what happens at the end of the series?\n",
    "data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# 'periods=1' is the default first arg\n",
    "data.shift(periods=1).tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Note that the `shift()` method effectively *slides the index up (relative to the data),  keeping the same index range but clipping or extending the data column as needed.*\n",
    "\n",
    "What if you want the data column to have the same range, but shift the *index*? For this case, the `shift()` method has a `freq` kwarg to use instead of the (implicit) `periods` kwarg."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# a timely reminder of bit.ly/pd-offsets\n",
    "data.shift(freq='2D').head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Note that the data column stayed fixed, but we adjusted each value in the index according to the `freq` kwarg while maintaining the period e.g. one month."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "data.shift(freq='1M').tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "Ok, great, now that we can simply shift the existing data column; let's attach that our original frame as our first \"forecast\" series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# use a copy for our forecast frame\n",
    "d = copy.deepcopy(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# assign the \"forecast\" to be the 1-period shifted version of the \"observed\" column\n",
    "d['forecast'] = data.shift(1)[data_name]\n",
    "\n",
    "d.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We made a forecast!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "Image(filename='misc/kermit.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Now how do we assess our model? \n",
    "\n",
    "It's always a good idea to do a visual inspection of the data. Both in the original of the data (time series), and also in the space of your truth vs predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# how does our forecast look by eye?\n",
    "d.plot()\n",
    "\n",
    "plt.title('naive')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Not so bad! Certainly better than a random number generator. \n",
    "\n",
    "Another common way to inspect a prediction (in any supervised learning task) is to plot the true data against the predicted data and fit a line through it. In the case where `prediction = truth`, this will be a beautiful, straight line, with zero residual error across the entire data set. \n",
    "\n",
    "If you do see a beautiful line like this, you most likely made a mistake somewhere; the real world is messy. Double check that you didn't accidentally train on your hold-out data, or evaluate on your training data, or something similar. \n",
    "\n",
    "In the real world (if your model is good), your plot should appear roughly linear, and should have some variance around the line you would draw through the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "plt.scatter(d[data_name], d[data_name])\n",
    "\n",
    "plt.xlabel('truth')\n",
    "plt.ylabel('also truth')\n",
    "plt.title('this will never happen');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "plt.scatter(d[data_name], d['forecast'])\n",
    "\n",
    "plt.xlabel('truth')\n",
    "plt.ylabel('forecast')\n",
    "plt.title(\"variance is a sign that you're alive\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Looks pretty good! We'll come back to a more **quantitative assessment** of this prediction in a bit. \n",
    "\n",
    "Before we move to other models, let's convert the naive model to a functional form. This way, as we develop more forecasting models, we can use the same API for consistency.\n",
    "\n",
    "### `numpy` > `pandas`\n",
    "\n",
    "While we typically have our data in DataFrames, we'll see that most (all?) of our forecasting methods don't make use of the extra metadata in a DataFrame or Series. As a result, we'll define our forecasting models with the expectaion of numpy arrays as input (which are more general), and we'll know that we can always use the `.values` attribute to get the respective numpy array data out of a `pandas` object.\n",
    "\n",
    "Here's an implementation of the naive model in functional form. It's a lot of typing for a simple model, but the pattern will prove useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def fc_naive(data, **kwargs):\n",
    "    \"\"\"The 'naive' forecast of the next point in `data` (presumed to be \n",
    "    ordered in time) is equal to the last point observed in the series.\n",
    "    \n",
    "    `data` should be a 1-D numpy array\n",
    "    \n",
    "    Returns a single-valued forecast for the next value in the series.\n",
    "    \"\"\"\n",
    "    forecast = data[-1]\n",
    "    return forecast"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "We can use this function to create a similar DataFrame to the one we used earlier. We'll loop over the observed data, and use the past data as our input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# container for the forecast\n",
    "forecasts = []\n",
    "\n",
    "# loop over positions in the array\n",
    "for idx in range(len(data[data_name])):\n",
    "    # subset the series from beginning to position idx\n",
    "    array_slice = data[data_name].iloc[:idx].values\n",
    "    if idx < 10:\n",
    "        # a view behind the scenes...\n",
    "        print('iteration {}, array_slice: {}'.format(idx, array_slice))\n",
    "    # make a forecast using that series\n",
    "    try:\n",
    "        forecasts.append( fc_naive(array_slice) )\n",
    "    except IndexError:\n",
    "        # the first position won't have a forecast value\n",
    "        forecasts.append(np.nan) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d = copy.deepcopy(data)\n",
    "d['forecast'] = forecasts\n",
    "\n",
    "d.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d.plot()\n",
    "\n",
    "plt.title('same naive graph');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "NOW FOR MOAR MODELS!\n",
    "\n",
    "\n",
    "## Seasonal naive method\n",
    "\n",
    "Instead of just forecasting based on the previous point, we might know there is a consistent cycle within the data. Often, the term \"seasonal\" is used to describe any sort of cyclic behavior. Seems sloppy, but oh well. In this case, we can prescribe how far back to look in the series, and forecast the corresponding value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def fc_snaive(data, n=7, **kwargs):\n",
    "    \"\"\"The 'seasonal naive' forecast of the next point in `data` (presumed to be \n",
    "    ordered in time) is equal to the point observed `n` points prior in the series.\n",
    "    \n",
    "    `data` should be a 1-D numpy array\n",
    "    `n` should be an integer\n",
    "    \n",
    "    Returns a single-valued forecast for the next value in the series.\n",
    "    \"\"\"\n",
    "    forecast = data[-n]\n",
    "    return forecast"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Note that we aren't paying any attention to the observed resolution of the data, only the relative position of any cyclic behavior.\n",
    "\n",
    "Since we want to create a new forecast array (as before), let's make a function that encapsulates the iteration over an observed array and returns the corresponding forecast array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def forecast_series(observed, fc_func, **kwargs):\n",
    "    \"\"\"Returns an array of forecasted values (using `fc_func` and any `kwargs` like a window `n`) \n",
    "    for each value in the input np.array `observed`. \n",
    "    \"\"\"\n",
    "    # container for the forecast\n",
    "    forecasts = []\n",
    "\n",
    "    for idx in range(len(observed)):\n",
    "        # subset the series from beginning to position idx\n",
    "        array_slice = observed[:idx]\n",
    "        # make a forecast using that series\n",
    "        try:\n",
    "            forecasts.append( fc_func(array_slice, **kwargs) )\n",
    "        except IndexError:\n",
    "            # the first position won't have a forecast value\n",
    "            forecasts.append(np.nan) \n",
    "    return forecasts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d = copy.deepcopy(data)\n",
    "\n",
    "# our data is monthly, and i have a hunch about quarterly cycles, so let's use n=3 (3 months in a quarter)\n",
    "forecasts = forecast_series(d[data_name].values, fc_snaive, n=3)\n",
    "\n",
    "d['forecast'] = forecasts\n",
    "\n",
    "d.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d.plot()\n",
    "\n",
    "plt.title('seasonal naive (n=3)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "plt.scatter(d[data_name], d['forecast'])\n",
    "\n",
    "plt.xlabel('truth')\n",
    "plt.ylabel('forecast')\n",
    "plt.title('seasonal naive method')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Mean method\n",
    "\n",
    "Another straightforward model consists of averaging the past `n` points and forecasting using that mean value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def fc_mean(data, n=3, **kwargs):\n",
    "    \"\"\"The 'mean' forecast of the next point in `data` (presumed to be \n",
    "    ordered in time) is equal to the mean value of the most recent `n` observed points.\n",
    "    \n",
    "    `data` should be a 1-D numpy array\n",
    "    `n` should be an integer\n",
    "    \n",
    "    Returns a single-valued forecast for the next value in the series.\n",
    "    \"\"\"\n",
    "    # don't start averaging until we've seen n points\n",
    "    if len(data[-n:]) < n:\n",
    "        forecast = np.nan\n",
    "    else:\n",
    "        # nb: we'll keep the forecast as a float\n",
    "        forecast = np.mean(data[-n:])\n",
    "    return forecast"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d = copy.deepcopy(data)\n",
    "\n",
    "# let's try a 4-point rolling mean\n",
    "forecasts = forecast_series(d[data_name].values, fc_mean, n=4)\n",
    "\n",
    "d['forecast'] = forecasts\n",
    "\n",
    "d.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d.plot()\n",
    "\n",
    "plt.title('mean forecast (n=3)');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "plt.scatter(d[data_name], d['forecast'])\n",
    "\n",
    "plt.xlabel('truth')\n",
    "plt.ylabel('forecast')\n",
    "plt.title('mean method')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## Drift method\n",
    "\n",
    "The last simple model we'll look at involves a linear extrapolation from recently observed data. In this case, the prediction is an adjustment from the most recent observed point, according to the slope extrapolated from two points: one at `n` points prior, and the most recent point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def fc_drift(data, n=3, **kwargs):\n",
    "    \"\"\"The 'drift' forecast of the next point in `data` (presumed to be \n",
    "    ordered in time) is a linear extrapoloation from `n` points ago, through the\n",
    "    most recent point.\n",
    "    \n",
    "    `data` should be a 1-D numpy array\n",
    "    `n` should be an integer\n",
    "    \n",
    "    Returns a single-valued forecast for the next value in the series.\n",
    "    \"\"\"\n",
    "    yi = data[-n]\n",
    "    yf = data[-1]\n",
    "    slope = (yf - yi) / (n-1)\n",
    "    forecast = yf + slope\n",
    "    return forecast"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d = copy.deepcopy(data)\n",
    "\n",
    "# let's try a 5-point drift method\n",
    "forecasts = forecast_series(d[data_name].values, fc_drift, n=5)\n",
    "\n",
    "d['forecast'] = forecasts\n",
    "\n",
    "d.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d.plot()\n",
    "\n",
    "plt.title('drift method');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "plt.scatter(d[data_name], d['forecast'])\n",
    "\n",
    "plt.xlabel('truth')\n",
    "plt.ylabel('forecast')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Model accuracy metrics\n",
    "\n",
    "At this point, we've looked at four different simple models: naive, seasonal naive, (rolling) mean, and drift. A next reasonable question is: which one best reflects our data? To answer that, we'll look to some metrics for model accuracy measurements.\n",
    "\n",
    "As is often the case, [scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) has a module that supplies a handful of these out of the box. The first thing you'll note is that there are more metrics for classification accuracy than for regression accuracy evaluation. Still, at least we don't have to reinvent the wheel!\n",
    "\n",
    "First, let's make some forecasts..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d = copy.deepcopy(data)\n",
    "\n",
    "# feel free to tweak the 'n' args\n",
    "model_list = [('naive', fc_naive, 0),\n",
    "              ('seasonal_naive', fc_snaive, 3),\n",
    "              ('mean', fc_mean, 3),\n",
    "              ('drift', fc_drift, 5)]\n",
    "\n",
    "# create new cols for each model\n",
    "for name, model, nn in model_list:\n",
    "    d[name] = forecast_series(d[data_name].values, model, n=nn)    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "d.plot()\n",
    "\n",
    "plt.title('ALL THE FORECASTS!');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "for name, series_data in d.items():\n",
    "    plt.plot(d[data_name], series_data, 'o', alpha=0.6, label=name)\n",
    "\n",
    "plt.xlabel('truth')    \n",
    "plt.ylabel('pred')\n",
    "plt.title('another view')\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another view on these charts to quantify the quality of a model is to look at the distribution of residuals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "comparison = 'naive'\n",
    "\n",
    "(d[data_name] - d[comparison]).hist(bins=30)\n",
    "\n",
    "plt.xlabel('residuals')\n",
    "plt.title('residual distribution for method: {}'.format(comparison));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "They all look pretty good, but we can be more specific.\n",
    "\n",
    "\n",
    "### $R^2$ (coefficient of determination) score \n",
    "\n",
    "The $R^2$ score is a common regression scoring method. The user guide (and wikipedia) have nice explanations of both the defintion, and the domain; in short, the maximum value is 1.0, scoring a constant prediction that is the mean value of the observed data will give 0.0, and it can be arbitraily negative). The $R^2$ metric is basically what your eyeballs are assessing when you look at a scatter plot of the truth data vs. the predicted data. \n",
    "\n",
    "Let's look at the $R^2$ value for the models that we've introduced so far."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "print('* R2 scores (bigger = better) *\\n')\n",
    "\n",
    "# calculate R2 for each model (against the observed data)\n",
    "for name, series_data in d.items():\n",
    "    # strip rows with nans \n",
    "    subdf = d[[data_name, name]].dropna()\n",
    "    truth = subdf[data_name].values\n",
    "    pred = subdf[name].values\n",
    "    # calculate metric\n",
    "    r2 = r2_score(truth, pred)\n",
    "    print('{} - {:.4f}'.format(name, r2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Feel free to experiment (or BYO GridSearchCV), but I think you'll typically find these are all >0.95, and the naive model frequently has the highest value.\n",
    "\n",
    "\n",
    "### Mean squared error\n",
    "\n",
    "The mean squared error (MSE) is a simpler calculation that is the expected value of the quadratic error.\n",
    "\n",
    "$$ MSE(y,\\hat{y}) = \\frac{1}{n_{samples}} \\sum_i (y_i - \\hat{y}_i)^2 $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "print('* MAE scores (smaller = better) *\\n')\n",
    "\n",
    "# calculate MAE for each model (against the observed data)\n",
    "for name, series_data in d.items():\n",
    "    # strip rows with nans \n",
    "    subdf = d[[data_name, name]].dropna()\n",
    "    truth = subdf[data_name].values\n",
    "    pred = subdf[name].values\n",
    "    # calculate metric\n",
    "    mae = mean_squared_error(truth, pred)\n",
    "    print('{} - {:.4f}'.format(name, mae))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Since MAE isn't normalized, it's a little hard to eyeball (big numbers), and to compare different models. There are also mean absolute, and median absolute errors, if you feel like you want to \"penalize\" outliers in any particular fashion. \n",
    "\n",
    "I like $R^2$ because it's relateively straightforward, but it's good to know there are alternatives.\n",
    "\n",
    "\n",
    "# Wrap-up\n",
    "\n",
    "While none of these models are particularly fancy, they provide great baselines for any other model you can dream up. They are generally very fast and straightforward to evaluate, and provide a surprisingly high accuracy on many forecasting tasks. \n",
    "\n",
    "For some broader context on forecasting (if, sadly, written in `R`), check out Ref [1].\n",
    "\n",
    "Hopefully, there are still two classes of time series modeling that we'll look at in the future: so-called *\"state space models\"* (of which the most famous is the ARIMA family), and more recent, *neural network-based models*. When we do get to these models, now we can baseline them against these simple methods!\n",
    "\n",
    "\n",
    "--------\n",
    "\n",
    "\n",
    "# `pandas` Appendix\n",
    "\n",
    "There are a handful of `pandas` methods that are time-series related, but not necessarily about forecasting and this seemed like a good place to highlight them. In particular, if you'd like to transform a time series according to a rolling window average, or exponential smoothing, there are efficient built-in methods for that!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# recall our original 'data' dataframe\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Window functions\n",
    "\n",
    "The [official docs](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions) include many examples, but a common pattern is creating a `Rolling` object - which has the notion of a window size, and a window type (square, triangular, etc.) - and then applying functions like you would after a groupby or a resample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# make a \"rolling\" object that we can use for calculations\n",
    "r = data.rolling(window=5)\n",
    "\n",
    "# this object can be treated much like a GroupBy object\n",
    "r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# we can apply a number of methods to the Rolling object, like standard numerical calcs\n",
    "r.mean().head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "plt.plot(data, 'o--', label=data_name)\n",
    "plt.plot(r.mean(), '.-', label='rolling mean')\n",
    "\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "plt.plot(data, 'o--', label=data_name)\n",
    "plt.plot(r.max(), '.-', label='rolling max')\n",
    "\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# calculate stdev on the rolling object within window size\n",
    "stds = r.std()\n",
    "\n",
    "# add the stdevs as error bars on each point\n",
    "data.plot(style='o', yerr=stds)\n",
    "\n",
    "plt.title('observed data points + windowed stdev')\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### Exponentially weighted windows\n",
    "\n",
    "Finally, if you should want them, there are windows that have an exponentially decaying weight. These are relatively new to pandas, and have[questionable documentation](http://pandas.pydata.org/pandas-docs/stable/computation.html#exponentially-weighted-windows). But, they're configured by either a `span` (much like the rolling window), the decay constant `alpha`, or a couple of related. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "plt.plot(data, 'o-', label=data_name)\n",
    "plt.plot(data.ewm(span=5).mean(), '.--', label='EMW')\n",
    "\n",
    "plt.legend();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# References\n",
    "\n",
    "[1] [OText on Forecasting](https://www.otexts.org/fpp/2/3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
